On Composing Matrix Multiplication from Kernels
نویسندگان
چکیده
Matrix multiplication is often treated as a basic unit of computation in terms of which other operations are implemented, yielding high performance. In this paper initial evidence is provided that there is a benefit gained when lower level kernels, from which matrix multiplication is composed, are exposed. In particular it is shown that matrix multiplication itself can be coded at a high level of abstraction as long as interfaces to such low-level kernels are exposed. Furthermore, it is shown that higher performance implementations of parallel matrix-multiplication can be achieved by exploiting access to these kernels. Experiments on Itanium2 and Pentium 4 servers support these insights.
منابع مشابه
When to Cache Block Sparse Matrix Multiplication: A Statistical Learning Approach
In previous work it was found that cache blocking of sparse matrix vector multiplication yielded significant performance improvements (upto 700% on some matrix and platform combinations) however deciding when to apply the optimization is a non-trivial problem. This paper applies four different statistical learning techniques to explore this classification problem. The statistical techniques use...
متن کاملA New Parallel Matrix Multiplication Method Adapted on Fibonacci Hypercube Structure
The objective of this study was to develop a new optimal parallel algorithm for matrix multiplication which could run on a Fibonacci Hypercube structure. Most of the popular algorithms for parallel matrix multiplication can not run on Fibonacci Hypercube structure, therefore giving a method that can be run on all structures especially Fibonacci Hypercube structure is necessary for parallel matr...
متن کاملA Fast GEMM Implementation On a Cypress GPU
We present benchmark results of optimized dense matrix multiplication kernels for Cypress GPU. We write general matrix multiply (GEMM) kernels for single (SP), double (DP) and double-double (DDP) precision. Our SGEMM and DGEMM kernels show ∼ 2 Tflop/s and ∼ 470 Glop/s, respectively. These results for SP and DP correspond to 73% and 87% of the theoretical performance of the GPU, respectively. Cu...
متن کاملAccelerating GPU Kernels for Dense Linear Algebra
Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major building block of dense linear algebra (DLA) libraries, and therefore have to be highly optimized. We present some techniques and implementations that significantly accelerate the corresponding routines from currently available libraries for GPUs. In particular, Pointer Redirecting – a set of GPU specific optimiz...
متن کاملCompiler-Optimized Kernels: An Efficient Alternative to Hand-Coded Inner Kernels
The use of highly optimized inner kernels is of paramount importance for obtaining efficient numerical algorithms. Often, such kernels are created by hand. In this paper, however, we present an alternative way to produce efficient matrix multiplication kernels based on a set of simple codes which can be parameterized at compilation time. Using the resulting kernels we have been able to produce ...
متن کامل